Requisites

# Load packages.
library(tidyverse)

STUFF

Bayesian calibration, on the other hand, can, in principle, handle what history matching can’t, but it requires us to have done a good job of incorporating the likelihood of the aforementioned outputs in the prior. This distinction between history matching and Bayesian calibration might be pithily summarised as “History matching is a quicker and easier way to learn about a little, while Bayesian calibration is a time-consuming and more-difficult way to learn about a lot”.

Edwards, Cameron and Rougier suggest that history matching can be a useful pre-calibration step to sensibly inform the prior for Bayesian calibration, but this is only reasonable if the desirable responses of the outputs to the inputs is centred within the Not Yet Ruled Out space. By analogy, sure, it seems sensible to start looking for your lost keys under the streetlight at night, but that doesn’t mean they aren’t actually somewhere in the dark.



********************************************

TL;DR

History-matching methods are just fancy ways to get rid of input values that give ridiculous output values. You do this by considering what kind of previously-used input values gave ridiculous output values, and deciding not to carry them forward.

For example, if you’ve ever made a meal and added too much salt, you will have learned how much salt makes for a bad meal. Next time you make the meal and you are unsure about how much salt to add, you will consider your previous (a.k.a. historical) salting and say “Well, there is definitely no point in adding more salt than I did last time”. Congratulations, you have just applied history matching to constrain the range of inputs (i.e. salt) so that the outputs (i.e. tastiness of your meal) are more likely to be tasty / desirable / sensible / realistic!

History matching is often applied to Gaussian processes, which you can read about here. In this blog, I will also apply history matching to Gaussian processes. You can read more about Gaussian processes, in my other explainer, here

(Side note: Incidentally, check out this Art of Manliness podcast episode with Daniel Holzman to learn a lot about appropriate salting and more.)



Why?

So, what is the motivation behind Gaussian processes?





Seriously, what?

To understand history matching, I found Williamson et al.’s idea of the Not Ruled Out Yet space to be useful. The Not Ruled Out Yet space is the space of inputs (i.e. the scope of possible values for the inputs) that remains after I’ve gotten rid of the input values that produce ridiculous and unlikely outputs. We can come up with any number of ways to decide what should be ruled in and ruled out. Adrianakis et al. have a similar idea called the “non-implausible region” For example, the implausibility measure used by Williamson et al. is the difference between average predicted output value and observed output value, scaled to the standard deviation of all these differences:

\[ I(\ parameter\ value_{j}\ ) = max\{\ I_{i}(parameter\ value_{j})\ \} \]

where

\[ I_{i}(parameter\ value_{j}) = \frac{ |\ observation_{i} - E[\ prediction_{i}( parameter\ value_{j})\ ]\ | } {\sqrt{ Var[\ observation_{i} - E[\ prediction_{i}(parameter\ value_{j})\ ]\ ] }} \] and where \(i\) refers to runs of the model, and \(j\) refers to particular parameter values. You use the implausability measure to rank parameter values then trim off the worst.

Any choice of rule comes with trade-offs. The implausibility measure, for example, can be compared across different predictive models even with different outputs because it is a unitless, scaled score. But, if its application means it always trims the most-extreme scores, then repeated application (called “refocussing”) can result in a situation like Zeno’s dichotomy paradox and can mean the Not Ruled Out Yet space vanishes. To prevent this, some absolute, minimum size of the Not Ruled Out Yet space needs to be set…but couldn’t we just have set an absolute, maximum accepted difference in the first place?





History matching is also known as Monte Carlo filtering, which Saltelli et al. (2008, page 248) summarise as

“one samples the space of the input factors as in ordinary [Monte Carlo], and then categorizes the corresponding model output as either within or without the target region (the terms behaviour, \(B\), or nonbehaviour, \(\bar{B}\), are used). This categorization is then mapped back onto the input factors, each of which is thus also partitioned into a behavioural and nonbehavioural subset.”



An alternative: Bayesian calibration

* What is bayesian calibration? * I think that distinction between history matching and Bayesian calibration is somewhat analogous to the distinction between fixed effects and random effects in regression analysis. The analogy I’m alluding to is that estimating regression coefficients for fixed effects is like saying “This is the estimate; other estimates are not acceptable”, and filtering parameter values using history matching is like saying “These parameter values are acceptable. Other parameter values are not acceptable”.

Similarly, estimating regression coefficients for random effects is like saying “This is the distribution of acceptable coefficients, but I’m not saying any particular one is best”, and filtering parameter values using Bayesian calibration is like saying “This is the distribution of acceptable parameter values, but I’m not saying any particular one is best”.

Of course, there are many differences between fitting regression coefficients and filtering parameter values in simulation models. I present the previous analogy to share something that helped me on my journey to understand all these concepts. Analogies are temporary rafts to help us across rivers of confusion; they need not be perfect and they can be left behind once we reach the firm understanding of the other side.

For more on Bayesian calibration, check out…



How?: Pinching your curves





Assessing whether the matching helped

A general rule for life is never to solely rely on a seller’s word about the performance of some tool. Also, never forget that, in theory, theory and practive are the same but, in practice, they aren’t. Thankfully, there are sensible ways to assess how much our history matching might have helped. As a general warning, beware of academic clickbait, i.e. manipulation by relative and absolute statistics.

  1. The absolute decrease in the count of inputs, \(n_{decrease} = n_{before} - n_{after}\).
  • Good:
  • Bad: Not a relative statistic so it is not comparable with other methods; Depends on the initial count of inputs that was arbitrarily chosen.
  1. The relative decrease in the count of inputs, \(percentage\ decrease = 100 \times \frac{n_{before} - n_{after}}{n_{before}}\) (i.e. “X% fewer inputs”) or \(multiplicative\ decrease = \frac{n_{before}}{n_{before} - n_{after}}\) (i.e. “inputs space is X-times smaller”).
  • Good: Enables comparison with other methods
  • Bad: Can be misleading because it depends on the initial count of inputs that was arbitrarily chosen.
  1. Average optical depth, i.e. \(log_{10}\) probability of encountering a non-implausible output for given inputs.
  • Good:
  • Bad:
  1. Average implausiblity
  • Good:
  • Bad:
  1. Probability of selecting a parameter set that fits all outputs
  • Good:
  • Bad:

Final thoughts and other resources

The dangers of selection

History matching assumes that outputs from previous models are sufficient proxy observations of the truth, so we should use them to constrain our wider understanding of the truth. But what if the outputs from our previous models poorly represent the truth? Selection bias is a beast!

I think selection bias is easiest to understand as collider bias using directed acyclic graphs, which is beyond the scope of this blog. Check out figure 1 in Griffith et al. for a simple illustration of how selection bias changes our perception of a relationship (URL in code box):

knitr::include_graphics("https://media.springernature.com/full/springer-static/image/art%3A10.1038%2Fs41467-020-19478-2/MediaObjects/41467_2020_19478_Fig1_HTML.png?as=webp")



Williamson et al. hint at selection bias when they say the following about ensemble filtering:

“The issue with this approach is that it does not account for the uncertainty in the unsampled, high dimensional parameter space”.

But the authors don’t acknowledge this issue for history matching; why not? Perhaps I am missing something. Please, let me know.

(Side note: Selection bias isn’t merely us focussing on a subset of situations. Focussing our analyses is permissible if we acknowledge that we are doing it. If you don’t realise that you have selected a subset of situations and you make statements about the whole, then your inferences are incorrect – plain and simple.)



The dangers of extrapolation

Another problem I have with history matching - and indeed any attempt to infer beyond what we actual observe - is that it assumes a linearity of output values’ effects to input values. By disregarding all input values beyond our Not Ruled Out Yet space, we don’t permit non-linear effects, reversal of effects, or pockets of sensible outputs after a gap of nonsensical ones. Below are three examples:

  • Extrapolating from pre-1960 underestimates global surface temperature today
# Prepare data
webscrape <- read.csv("https://data.giss.nasa.gov/gistemp/graphs/graph_data/Global_Mean_Estimates_based_on_Land_and_Ocean_Data/graph.txt")
df <-
  webscrape %>% 
  tidyr::separate_rows(colnames(webscrape), sep = ", ") %>% 
  dplyr::slice(4:n()) %>%
  tidyr::separate(colnames(webscrape), sep = "\\s+", into = c("Year", "No_Smoothing", "Lowess(5)")) %>%
  dplyr::select(-`Lowess(5)`) %>%
  dplyr::mutate_if(is.character, as.numeric)
# Add the 1951-1980 baseline value
# https://earthobservatory.nasa.gov/world-of-change/decadaltemp.php
df$No_Smoothing <- df$No_Smoothing + 14
# Fit linear linear model to data from 1960.
timespan <-
  df %>% dplyr::select(Year)
df <-
  df %>%
  dplyr::filter(Year < 1920) %>%
  lm(formula = No_Smoothing ~ Year, data = .) %>%
  predict(newdata = df %>% dplyr::select(Year)) %>%
  as.data.frame() %>%
  `colnames<-`("trend") %>%
  dplyr::bind_cols(df)
# Plot.
(p_tempExtrap <-
    df %>%
    ggplot() +
     geom_point(aes(x = Year, y = No_Smoothing),
                size = 3, color = "grey") +
     geom_line(aes(x = Year, y = trend),
               linewidth = 3) +
    ylab("Global average\nsurface temperature (Celsius)") +
    labs(caption = "Data source: https://climate.nasa.gov/vital-signs/global-temperature/") +
    theme(text = element_text(size = 20))
    
)

knitr::include_graphics("https://bpb-eu-w2.wpmucdn.com/blogs.bristol.ac.uk/dist/e/692/files/2022/01/PlotTwo-1024x512.png")

knitr::include_graphics("https://i.makeagif.com/media/6-28-2023/O09D6n.gif")



Baby out with the bathwater

The gist of history of Williamson et al.’s idea of the Not Ruled Out Yet space is that we throw out the input values that produce ridiculous and unlikely outputs. But some of these input values might have produced output values within the Not Ruled Out Yet space. It seems very conservative to throw out any input values that misbehaved a little.